Chapter 1: The Thesis – LLM as Kernel

The robot at a crossroads between traditional compiled code and glowing LLM text streams

Every robot ever shipped follows the same pipeline. A human engineer reasons about the problem, writes code in a programming language designed for human readability, feeds it through a compiler that strips away all that readability, and produces a binary blob that the robot executes without understanding a single instruction. The entire history of robotics firmware is a story of human thought being compressed into machine code through a lossy, irreversible pipeline. The robot never had a mind. It had compiled instructions.

Large language models break this pipeline – but not in the way most people think.


The Wrong Way to Use an LLM in Robotics

The obvious approach is to slot an LLM into the existing pipeline. You keep your ROS stack, your C++ firmware, your PID controllers, and you add a chatbot layer on top. The user says “go to the kitchen,” the LLM translates that into a goal coordinate, and the traditional stack does all the real work. The LLM is a natural-language frontend. It is a very expensive switch statement.

This approach works, but it wastes most of what makes LLMs interesting. An LLM is not just a text parser. It is a reasoning engine that can hold context, plan across time horizons, explain its decisions, recover from novel failures, and improve through experience. Using it as a glorified command parser is like buying a GPU to run a spreadsheet.

The deeper problem is that the traditional pipeline forces the LLM to speak a language that was never designed for it. Object-oriented programming, function signatures, type systems – these are abstractions invented to help humans manage complexity. When you ask an LLM to generate Python or C++ to control a robot, you are asking it to think in a human’s abstraction layer. It can do it. But it is not native.


The Inversion

LLMos inverts the relationship. Instead of the LLM serving the traditional software stack, the traditional software stack serves the LLM. The language model is not a plugin. It is the kernel – the central decision-making authority that perceives the world, reasons about it, and issues high-level commands that deterministic systems execute.

This inversion operates at two levels.

At development time, a cloud-hosted LLM (Claude Opus 4.6) is the primary developer. It creates agents, writes skill files, reasons about architecture, and evolves robot behaviors. Its native interface is not Python or C++ but markdown – structured text documents that serve as both documentation and executable specifications. The agents it creates live in a volume-based file system, organized by access level (System, Team, User), and communicate through an agent-to-agent messaging protocol. Object-oriented programming is replaced by something closer to how the LLM naturally thinks: structured prose with embedded logic.

At runtime, a local vision-language model (Qwen3-VL-8B) runs the robot’s navigation loop. It receives a structured execution frame every cycle – occupancy grid, symbolic objects, candidate subgoals, camera frames, action history – and returns a JSON decision: where to go, what to do if that fails, and why. This is not free-form text. It is a constrained output schema, a kind of bytecode that the LLM generates natively. The LLM picks strategy. Classical planners execute tactics.

The key insight is that these two levels of inversion map directly to two different types of intelligence:

  Development (Design-Time) Runtime (Execution-Time)
LLM Claude Opus 4.6 (cloud) Qwen3-VL-8B (local GPU)
Interface Markdown files, agent specs JSON execution frames
Speed Seconds to minutes Milliseconds per cycle
Cost Acceptable (design is rare) Must be minimal (10Hz loop)
Output Agent definitions, skill files Navigation decisions
Analogy Architect drawing blueprints Pilot flying the plane

The Model-Size Boundary

Not everything should be generated by the LLM at runtime. There is a boundary – call it the model-size boundary – that determines what gets hardcoded in TypeScript and what gets decided by the LLM each cycle.

Below the boundary: A* pathfinding, occupancy grid management, sensor fusion, motor PID control, collision detection. These are fast, deterministic, and well- understood. There is no value in asking an LLM to reinvent A* every cycle.

Above the boundary: goal selection, strategy switching, stuck recovery, exploration planning, world model corrections, multi-step reasoning. These require the kind of flexible, context-dependent reasoning that LLMs excel at.

The boundary is not fixed. As smaller models become faster and more capable, it moves downward – more decisions become feasible to delegate to the LLM. LLMos is designed for this: the InferenceFunction abstraction means any component can be swapped between hardcoded logic and LLM inference without changing the surrounding system.

// The boundary in code: InferenceFunction is a plug point
// Below: deterministic. Above: LLM-decided.
export type InferenceFunction = (
  systemPrompt: string,
  userMessage: string,
  images?: string[]
) => Promise<string>;

This type signature, defined in lib/runtime/navigation-loop.ts, is one of the most important lines in the codebase. It is the contract between the deterministic world and the reasoning world. Everything below it – world model serialization, candidate generation, path planning, motor control – is classical robotics. Everything above it – goal selection, strategy, recovery, explanation – is LLM territory.


The Operating System Metaphor

The name “LLMos” is not decorative. It reflects a deliberate mapping from traditional operating system concepts to LLM-native equivalents:

Traditional OS LLMos Equivalent Implementation
Compiled binaries Markdown files public/volumes/system/skills/*.md
Processes Agent execution loops lib/agents/
Users Humans AND AI physical agents Dual user model
File system Volume system (System / Team / User) public/volumes/
Permissions Volume access + kernel rules Access control by volume tier
System calls LLM inference requests InferenceFunction
IPC Agent-to-agent messaging Message protocol
Package manager Skill promotion pipeline Skill lifecycle
Device drivers HAL adapters lib/hal/types.ts
Kernel The LLM itself Navigation loop + agent system

In a traditional OS, the kernel mediates between user programs and hardware. In LLMos, the LLM mediates between high-level goals and physical actuators. The HAL (lib/hal/types.ts) provides the hardware interface – locomotion, vision, communication, safety – and the LLM issues commands against it, just as a userspace program issues system calls.

The volume system organizes knowledge by access level. System volumes contain core skills that any agent can read but only the kernel can modify. Team volumes hold shared knowledge for robot fleets. User volumes contain per-robot learned behaviors. This mirrors Unix permission levels, but adapted for a world where the “programs” are markdown files and the “CPU” is an LLM.


Why This Matters for Physical Robots

A chatbot can afford to be slow. It can afford to hallucinate occasionally. It can afford to give verbose, hedged responses. A physical robot cannot. It is moving through real space where collisions have consequences, battery life is finite, and the environment does not pause while the LLM thinks.

This is why LLMos exists. It is an attempt to answer the question: how do you build a system where an LLM is genuinely in charge of a physical agent – not as a demo, but as an architecture that handles the real constraints of embodied intelligence?

The answer, as the rest of this book will show, involves:

  • Structured execution frames that compress the world into what the LLM needs to know, and nothing more (Chapter 3)
  • Constrained output schemas that ensure the LLM returns deterministic, executable decisions – not prose (Chapter 4)
  • Classical safety nets that catch any LLM failure before it reaches the motors (Chapter 7)
  • Dual-model architecture that puts the right LLM in the right loop at the right speed (Chapter 2)
  • Predictive intelligence that reduces the LLM’s cognitive load by predicting unseen space before observation (Chapter 9)

None of these ideas are unique to LLMos. But their specific combination – and the fact that they are implemented, tested, and validated end-to-end in a single TypeScript codebase – is what makes this system worth studying.


A Note on Karpathy’s LLM OS

Andrej Karpathy has articulated a vision of the “LLM OS” where the language model serves as the CPU of a new kind of operating system, with tools as peripherals and context windows as working memory. LLMos is an implementation of a subset of this vision, focused specifically on physical agents. The key difference is that LLMos does not treat the physical world as just another tool call. It builds a complete spatial reasoning stack – world models, path planners, sensor fusion – that gives the LLM a structured understanding of physical space, rather than leaving it to reason about coordinates from raw text.

The LLM OS vision is broad. LLMos is narrow and deep. It takes one domain – robot navigation – and builds the full stack from philosophy to motor commands.


Summary

The thesis of LLMos is that large language models should not be accessories to traditional robotics systems. They should be the kernel. The traditional pipeline (human thought compressed into compiled binaries) is replaced by an inverted pipeline where the LLM reasons natively through structured frames and the deterministic stack serves as its execution engine.

This requires two levels of inversion: development-time (markdown as the native program format) and runtime (JSON execution frames as LLM bytecode). It requires a clear boundary between what the LLM decides and what classical systems execute. And it requires an operating system metaphor that maps every traditional OS concept to an LLM-native equivalent.

The rest of this book shows how it works in practice – from the 50x50 occupancy grid through the 13-step navigation loop to the V1 Stepper Cube Robot hardware that brings the thesis off the screen and onto a desk.


Next: Chapter 2 – Two Brains: Development and Runtime