Chapter 1: The Thesis – LLM as Kernel

Every robot ever shipped follows the same pipeline. A human engineer reasons about the problem, writes code in a programming language designed for human readability, feeds it through a compiler that strips away all that readability, and produces a binary blob that the robot executes without understanding a single instruction. The entire history of robotics firmware is a story of human thought being compressed into machine code through a lossy, irreversible pipeline. The robot never had a mind. It had compiled instructions.
Large language models break this pipeline – but not in the way most people think.
The Wrong Way to Use an LLM in Robotics
The obvious approach is to slot an LLM into the existing pipeline. You keep your ROS
stack, your C++ firmware, your PID controllers, and you add a chatbot layer on top.
The user says “go to the kitchen,” the LLM translates that into a goal coordinate,
and the traditional stack does all the real work. The LLM is a natural-language
frontend. It is a very expensive switch statement.
This approach works, but it wastes most of what makes LLMs interesting. An LLM is not just a text parser. It is a reasoning engine that can hold context, plan across time horizons, explain its decisions, recover from novel failures, and improve through experience. Using it as a glorified command parser is like buying a GPU to run a spreadsheet.
The deeper problem is that the traditional pipeline forces the LLM to speak a language that was never designed for it. Object-oriented programming, function signatures, type systems – these are abstractions invented to help humans manage complexity. When you ask an LLM to generate Python or C++ to control a robot, you are asking it to think in a human’s abstraction layer. It can do it. But it is not native.
The Inversion
LLMos inverts the relationship. Instead of the LLM serving the traditional software stack, the traditional software stack serves the LLM. The language model is not a plugin. It is the kernel – the central decision-making authority that perceives the world, reasons about it, and issues high-level commands that deterministic systems execute.
This inversion operates at two levels.
At development time, a cloud-hosted LLM (Claude Opus 4.6) is the primary developer. It creates agents, writes skill files, reasons about architecture, and evolves robot behaviors. Its native interface is not Python or C++ but markdown – structured text documents that serve as both documentation and executable specifications. The agents it creates live in a volume-based file system, organized by access level (System, Team, User), and communicate through an agent-to-agent messaging protocol. Object-oriented programming is replaced by something closer to how the LLM naturally thinks: structured prose with embedded logic.
At runtime, a local vision-language model (Qwen3-VL-8B) runs the robot’s navigation loop. It receives a structured execution frame every cycle – occupancy grid, symbolic objects, candidate subgoals, camera frames, action history – and returns a JSON decision: where to go, what to do if that fails, and why. This is not free-form text. It is a constrained output schema, a kind of bytecode that the LLM generates natively. The LLM picks strategy. Classical planners execute tactics.
The key insight is that these two levels of inversion map directly to two different types of intelligence:
| Development (Design-Time) | Runtime (Execution-Time) | |
|---|---|---|
| LLM | Claude Opus 4.6 (cloud) | Qwen3-VL-8B (local GPU) |
| Interface | Markdown files, agent specs | JSON execution frames |
| Speed | Seconds to minutes | Milliseconds per cycle |
| Cost | Acceptable (design is rare) | Must be minimal (10Hz loop) |
| Output | Agent definitions, skill files | Navigation decisions |
| Analogy | Architect drawing blueprints | Pilot flying the plane |
The Model-Size Boundary
Not everything should be generated by the LLM at runtime. There is a boundary – call it the model-size boundary – that determines what gets hardcoded in TypeScript and what gets decided by the LLM each cycle.
Below the boundary: A* pathfinding, occupancy grid management, sensor fusion, motor PID control, collision detection. These are fast, deterministic, and well- understood. There is no value in asking an LLM to reinvent A* every cycle.
Above the boundary: goal selection, strategy switching, stuck recovery, exploration planning, world model corrections, multi-step reasoning. These require the kind of flexible, context-dependent reasoning that LLMs excel at.
The boundary is not fixed. As smaller models become faster and more capable, it moves
downward – more decisions become feasible to delegate to the LLM. LLMos is designed
for this: the InferenceFunction abstraction means any component can be swapped
between hardcoded logic and LLM inference without changing the surrounding system.
// The boundary in code: InferenceFunction is a plug point
// Below: deterministic. Above: LLM-decided.
export type InferenceFunction = (
systemPrompt: string,
userMessage: string,
images?: string[]
) => Promise<string>;
This type signature, defined in lib/runtime/navigation-loop.ts, is one of the most
important lines in the codebase. It is the contract between the deterministic world
and the reasoning world. Everything below it – world model serialization, candidate
generation, path planning, motor control – is classical robotics. Everything above
it – goal selection, strategy, recovery, explanation – is LLM territory.
The Operating System Metaphor
The name “LLMos” is not decorative. It reflects a deliberate mapping from traditional operating system concepts to LLM-native equivalents:
| Traditional OS | LLMos Equivalent | Implementation |
|---|---|---|
| Compiled binaries | Markdown files | public/volumes/system/skills/*.md |
| Processes | Agent execution loops | lib/agents/ |
| Users | Humans AND AI physical agents | Dual user model |
| File system | Volume system (System / Team / User) | public/volumes/ |
| Permissions | Volume access + kernel rules | Access control by volume tier |
| System calls | LLM inference requests | InferenceFunction |
| IPC | Agent-to-agent messaging | Message protocol |
| Package manager | Skill promotion pipeline | Skill lifecycle |
| Device drivers | HAL adapters | lib/hal/types.ts |
| Kernel | The LLM itself | Navigation loop + agent system |
In a traditional OS, the kernel mediates between user programs and hardware. In
LLMos, the LLM mediates between high-level goals and physical actuators. The HAL
(lib/hal/types.ts) provides the hardware interface – locomotion, vision,
communication, safety – and the LLM issues commands against it, just as a userspace
program issues system calls.
The volume system organizes knowledge by access level. System volumes contain core skills that any agent can read but only the kernel can modify. Team volumes hold shared knowledge for robot fleets. User volumes contain per-robot learned behaviors. This mirrors Unix permission levels, but adapted for a world where the “programs” are markdown files and the “CPU” is an LLM.
Why This Matters for Physical Robots
A chatbot can afford to be slow. It can afford to hallucinate occasionally. It can afford to give verbose, hedged responses. A physical robot cannot. It is moving through real space where collisions have consequences, battery life is finite, and the environment does not pause while the LLM thinks.
This is why LLMos exists. It is an attempt to answer the question: how do you build a system where an LLM is genuinely in charge of a physical agent – not as a demo, but as an architecture that handles the real constraints of embodied intelligence?
The answer, as the rest of this book will show, involves:
- Structured execution frames that compress the world into what the LLM needs to know, and nothing more (Chapter 3)
- Constrained output schemas that ensure the LLM returns deterministic, executable decisions – not prose (Chapter 4)
- Classical safety nets that catch any LLM failure before it reaches the motors (Chapter 7)
- Dual-model architecture that puts the right LLM in the right loop at the right speed (Chapter 2)
- Predictive intelligence that reduces the LLM’s cognitive load by predicting unseen space before observation (Chapter 9)
None of these ideas are unique to LLMos. But their specific combination – and the fact that they are implemented, tested, and validated end-to-end in a single TypeScript codebase – is what makes this system worth studying.
A Note on Karpathy’s LLM OS
Andrej Karpathy has articulated a vision of the “LLM OS” where the language model serves as the CPU of a new kind of operating system, with tools as peripherals and context windows as working memory. LLMos is an implementation of a subset of this vision, focused specifically on physical agents. The key difference is that LLMos does not treat the physical world as just another tool call. It builds a complete spatial reasoning stack – world models, path planners, sensor fusion – that gives the LLM a structured understanding of physical space, rather than leaving it to reason about coordinates from raw text.
The LLM OS vision is broad. LLMos is narrow and deep. It takes one domain – robot navigation – and builds the full stack from philosophy to motor commands.
Summary
The thesis of LLMos is that large language models should not be accessories to traditional robotics systems. They should be the kernel. The traditional pipeline (human thought compressed into compiled binaries) is replaced by an inverted pipeline where the LLM reasons natively through structured frames and the deterministic stack serves as its execution engine.
This requires two levels of inversion: development-time (markdown as the native program format) and runtime (JSON execution frames as LLM bytecode). It requires a clear boundary between what the LLM decides and what classical systems execute. And it requires an operating system metaphor that maps every traditional OS concept to an LLM-native equivalent.
The rest of this book shows how it works in practice – from the 50x50 occupancy grid through the 13-step navigation loop to the V1 Stepper Cube Robot hardware that brings the thesis off the screen and onto a desk.