sysadmin-chronicles/docs/ARCHITECTURE.md

# SYSADMIN CHRONICLES — ARCHITECTURE DOCUMENT
> Version 5.0 | Status: Active development
>
> Changelog:
>   v5.0 — GDScript/Godot codebase removed. Node.js + Svelte is the only codebase.
>   v4.0 — Full architecture pivot to Node.js game server + Svelte web HUD.
>   v3.x — Save system, world flags, trust, incidents, pressure system (GDScript era).
>   v2.0 — Native Godot 4 + libvirt design (superseded).
>   v1.0 — Browser/v86 prototype (superseded).

---

## 1. PROJECT OVERVIEW

**Sysadmin Chronicles** is a native Linux-only game where the player works as a
junior sysadmin at Axiom Works, handling tickets inside **real Linux virtual
machines** managed by **QEMU/KVM via libvirt**.

The runtime stack (as of v4.0):
- **Game server** — Node.js / Express + WebSocket (`server/`). Owns all game
  logic: quest state, trust, validation, VM lifecycle, incidents, save state.
- **Web HUD** — Svelte single-page app (`frontend/`). Tickets, mail, Sage, docs,
  trust bar. Served from the game server at `http://192.168.100.1:3000`.
- **Workstation VM** — XFCE desktop (Debian 12, sc-workstation). Player's desk.
  Chromium auto-opens the HUD. Tilix provides a real terminal for SSH to target VMs.
- **Target VMs** — Headless Debian (hermes) and Arch (vulcan). Quest objectives
  live here. Player investigates and fixes via SSH from the workstation terminal.

The player experience:
- Sits at the workstation VM (via SPICE/remote-viewer fullscreen on the host)
- Reads tickets and mail in the Chromium HUD
- Opens Tilix, SSHes to hermes or vulcan, fixes real problems
- Clicks "Mark Complete" in the HUD — game server SSHes in and validates VM state
- World reacts, trust shifts, new mail arrives via WebSocket push

No simulated terminal. No fake SSH sessions.

---

## 2. CORE DESIGN PRINCIPLES

- Realism over simulation
- Native Linux execution only
- CLI-first development and asset wiring
- Minimal, stable scenes; behavior lives in scripts
- Data-driven content for quests, tickets, incidents, and dialogue
- State-based validation only; never command-sequence checking
- Multiple valid solutions where possible
- Pressure comes from evolving systems, not arbitrary timers
- Progression unlocks access, tools, and scope, not RPG stats
- Deterministic systems so content is testable and agent-friendly
- The dirty VM state is the game — preserve it, do not erase it

---

## 3. HIGH-LEVEL ARCHITECTURE

```
HOST MACHINE
├── game-server/          Node.js/Express + WebSocket  (server/src/)
│   ├── ContentLoader     loads content/ JSON at startup
│   ├── QuestEngine       quest state machine
│   ├── TicketService     ticket state, mark-complete handler
│   ├── ValidationEngine  SSH into VMs, evaluates rules
│   ├── VMManager         virsh start/stop/snapshot wrappers
│   ├── TrustSystem       score, unlock evaluation, revocation
│   ├── ProgressionSystem unlocked docs, VMs, access
│   ├── EmailService      inbox, follow-up emails, reply options
│   ├── SageService       rule-based knowledge base / dialogue
│   ├── ShiftTimer        shift clock, pressure tick schedule
│   ├── IncidentScheduler incident injection
│   └── SaveState         ~/.local/share/sysadmin-chronicles/save.json
│
├── frontend/             Svelte web HUD  (frontend/src/)
│   ├── TicketsPanel      ticket list, detail, "Mark Complete" button
│   ├── MailPanel         inbox, message view, reply buttons
│   ├── DocsPanel         trust-gated internal docs
│   ├── SagePanel         chat / knowledge base search
│   └── HeaderBar         trust indicator, shift timer, unread count
│
└── content/              JSON content — quests, tickets, dialogue, etc.

NETWORK: sc-internal (libvirt bridge 192.168.100.0/24)
  192.168.100.1  host  (game server port 3000)

VMs on sc-internal
├── sc-workstation (ares)   Debian 12 XFCE — player's desk
│   ├── Chromium → http://192.168.100.1:3000  (HUD, always open)
│   └── Tilix → SSH to hermes/vulcan          (real terminal)
├── sc-web-server (hermes)  headless Debian   (Q002–Q005, Q007)
└── sc-build-machine (vulcan) headless Arch   (Q006, Q008)

PLAYER FLOW:
  Host starts game server → boots sc-workstation via SPICE
  Player sees XFCE desktop → Chromium with HUD auto-open
  Reads ticket → opens Tilix → SSH hermes → fixes problem
  Clicks "Mark Complete" → server SSHes hermes → validates
  Trust updates → WebSocket pushes to browser → new mail arrives
```

---

## 4. RUNTIME MODEL

### 4.1 Game Server — Node.js

The game server (`server/src/index.js`) is a Node.js/Express application:
- Serves `frontend/dist/` as static files at `/`
- WebSocket server on the same port (real-time event push to HUD)
- On startup: loads all content JSON, hydrates services from save file,
  ensures workstation VM is live via VMManager

The server is responsible for:
- All game logic (quest state, trust, progression, incidents)
- VM lifecycle management (virsh via child_process)
- Validation — SSH into target VMs and evaluate rules
- Save/load (single JSON file at `~/.local/share/sysadmin-chronicles/save.json`)
- WebSocket broadcast of trust changes, new mail, shift ticks, incident alerts

### 4.2 Frontend — Svelte

The web HUD (`frontend/src/`) is a Svelte single-page app:
- Built with Vite; output lands in `frontend/dist/` and is served by the game server
- All data fetched from the game server API; no local state beyond UI
- WebSocket client for real-time updates
- Does not run validation — only displays results

### 4.3 Target Platform

- Host OS: Linux
- Supported deployment model: start game server on host, view workstation via SPICE
- Required host: KVM, libvirt, virsh, Node.js 18+, virt-viewer
- Required install model: one-time host setup with clean uninstall path

No Windows, macOS, or browser target is planned for the host. The HUD is a web
app served locally — it is never exposed to the internet.

---

## 5. VIRTUAL MACHINE SYSTEM

### 5.1 Required Stack

- `qemu-system-*`
- `KVM`
- `libvirtd`
- `virsh`
- libvirt virtual networks
- qcow2-backed VM images

Runtime policy:
- The shipped game should not require broad `sudo` usage during normal play
- One-time host setup may require admin approval
- Ongoing gameplay should run as a regular user against a prepared VM runtime

### 5.2 Core Behavior

The game controls VMs through libvirt, not by emulating them internally.

Responsibilities:
- Ensure required domains and networks exist
- Start the active VM
- Stop or suspend inactive VMs
- Revert to known snapshots for resets
- Query runtime state for evaluation
- Attach the player to the appropriate VM workflow

The workstation and at least one target VM must be able to run at the same
time. This is required for real SSH-based play and for background incidents to
continue evolving while the player works elsewhere.

Operational guidance:
- `workstation` stays live during normal play
- At least one target VM stays live with it
- Later phases may keep all major quest VMs active simultaneously
- Resource budgets should be documented and enforced conservatively

Lab finding:
- Small headless target VMs were inexpensive on the test host
- The workstation became materially heavier once a real graphical session and
  browser were added
- Budget the workstation separately from server-style quest VMs

### 5.3 Initial VM Roles

| ID | Role | Distro | Hostname | Purpose |
|----|------|--------|----------|---------|
| `workstation` | Player desktop | Debian 12 | `ares` | XFCE + Chromium HUD + Tilix terminal |
| `web_server` | Service host | Debian 12 | `hermes` | Web/service quests (Q002–Q005, Q007) |
| `build_machine` | Build box | Arch | `vulcan` | Package/build/update quests (Q006, Q008) |

### 5.3.1 Workstation Profile

The workstation is a full XFCE desktop (Debian 12, 768–1536 MB RAM):
- **Chromium** — opens `http://192.168.100.1:3000` on login (game HUD)
- **Tilix** — split-pane terminal, set as default; player SSHes to hermes/vulcan from here
- **Full sysadmin CLI toolkit** pre-installed (vim, htop, tmux, curl, nmap, tcpdump, etc.)
- SPICE display with QXL video — dynamic resolution via vdagent; fullscreen via `remote-viewer`
- `always_live: true` — stays running between shifts; suspended on game quit, resumed on next launch

Player never needs to interact with the workstation VM's internal file system for
game objectives — all quest work happens on the target VMs via SSH.

### 5.3.2 Why XFCE + Chromium (not terminal-only)

Earlier iterations used a terminal-only workstation. The game was redesigned
because a terminal-only approach would require building a fake terminal and fake SSH.
The XFCE + real browser approach is simpler, more realistic, and requires no
terminal simulation at all:

- Player uses a real Tilix terminal — no simulation
- Player SSHes with real SSH — no protocol emulation
- The HUD is a real web app — no custom UI framework needed for game chrome
- Downside: workstation VM costs ~480–768 MB RAM; budget accordingly

### 5.4 Snapshot Strategy

Snapshots are the reset primitive and the save primitive.

Named snapshot tiers per VM:

| Name | Purpose |
|------|---------|
| `baseline.clean` | Authored starting state for a fresh quest arc |
| `baseline.recovery` | Fallback if live state is unrecoverable |
| `checkpoint.shift-{N}` | Auto-saved at start of each in-game shift |

Rules:
- Snapshot names are deterministic
- Quest scripts may declare required baseline snapshots
- Validation never depends on snapshot history; only current observed state
- The game retains a maximum of 5 shift checkpoints per VM; older ones are pruned
- `baseline.clean` and `baseline.recovery` are never pruned by the game

### 5.5 Networking Model

Networking is host-controlled through libvirt.

Supported modes:
- `quest`: constrained, deterministic virtual networks and fixtures
- `sandbox`: broader connectivity for experimentation

Examples:
- Internal-only network between workstation and target VM
- Broken DNS as part of a quest
- Deliberately degraded service reachability
- Optional outbound package mirror access for selected scenarios

### 5.6 VM Provisioning Hooks

Quest-specific VM state — broken configs, missing files, log histories — is
authored into the VM baseline before the snapshot is taken. This is done via
idempotent provisioning scripts:

```
tools/vm/quest-prep/Q0XX-prep.sh
```

These scripts run against the target VM before the quest's `baseline.clean`
snapshot is taken. They are never run at quest activation time. See
QUEST_AUTHORING.md for the full provisioning workflow.

---

## 6. OBSERVATION AND VALIDATION

### 6.1 Validation Philosophy

Quest completion is based on **system state**, not on how the player got there.

Allowed evidence includes:
- Files and directory contents
- Ownership and permissions
- Service state
- Process state
- Open ports
- Package state
- Mount state
- Disk utilization
- System configuration values

Disallowed as primary success conditions:
- Specific commands typed
- Specific files opened
- UI click history

### 6.2 Observation Sources

Primary sources:
- `virsh domstate`, `domifaddr`, and domain metadata
- Host-driven inspection tooling such as libguestfs where practical
- SSH-based read-only checks initiated by the host when needed
- Quest-specific host probe scripts for higher-level state summaries

Authoritative rule:
- Quest validation must use host-authoritative checks only
- In-guest helpers may improve responsiveness, but cannot decide success

In-guest helpers should use neutral names (examples: `atlas-index`, `yardd`,
`ops-telemetry-cache`) and must not be trusted as a security boundary.

Operational note:
- Routine package operations inside guests may emit maintenance or virtualization
  notices that break immersion
- Base images should suppress or tune guest maintenance messaging where safe
  for the authored environment
- Validation and incident design should not rely on noisy package-manager side
  effects being visible to the player

### 6.3 Validation Rule Model

Core rule families:
- `file_exists` / `file_contains` / `file_mode` / `file_owner`
- `directory_exists`
- `service_state` / `service_enabled`
- `process_running` / `process_user`
- `port_listening`
- `package_installed`
- `mount_present`
- `disk_usage_below` / `disk_usage_above`
- `command_assert` — fallback only, must verify state not behavior
- `and` / `or` / `not`

### 6.4 Trust Boundary

The player may gain root access on some machines. The guest is not trusted. The
host validation layer is trusted. Anti-cheat is achieved through external
validation, not secrecy.

---

## 7. GAMEPLAY SYSTEMS

### 7.1 Core Loop

1. Ticket arrives with incomplete context
2. Player evaluates urgency against other active problems
3. Player enters or connects into the relevant VM
4. Player investigates using real Linux tools
5. Player applies a fix
6. Game validates resulting state
7. World reacts
8. Trust shifts
9. Future conditions reflect earlier choices

### 7.2 System Pressure

Pressure is systemic, not a countdown bar. Examples:
- Disk usage keeps climbing
- A log fills with worsening symptoms
- A degraded service starts affecting another team
- A quick fix suppresses one symptom while creating later instability

Pressure is authored as state transitions and event chains via incident files.

### 7.3 Trust / Reputation

Trust measures how much the organization relies on the player.

Trust affects:
- sudo scope
- accessible machines
- diagnostic tooling
- ticket sensitivity
- documentation visibility

**Trust increases** when the player resolves problems cleanly, finds root causes,
and avoids collateral damage.

**Trust decreases** when the player breaks unrelated systems, applies fragile
fixes, ignores urgent incidents, or resolves symptoms but not causes.

**Trust revocation**: if trust falls below a declared threshold in the trust
unlock table, specific access strings are revoked. A subsequent trust increase
does not automatically restore revoked access — the player must re-earn the
unlock tier. Revocation rules must be explicitly declared per unlock tier.

### 7.4 Multiple Valid Solutions

Quests support realistic alternatives where possible:
- quick workaround
- operationally acceptable fix
- proper long-term fix

Branch resolution rule:
- multiple branches may match the same final state
- each branch must declare a numeric `priority`
- the highest matching priority wins
- ties are a content error and fail validation during authoring checks

### 7.5 Dynamic Events

Dynamic events inject prioritization pressure and are authored in incident files.
Events are selected from authored pools and activated by progression, trust,
current system state, and world flags.

Each incident declares a `blast_radius_quests` list so the incident scheduler
can avoid activating an incident that would corrupt active quest evidence or
simultaneously interfere with an in-progress objective.

### 7.6 Investigation Quality

Clues must be legible and grounded. Every quest declares a `clue_fingerprint`
documenting what evidence exists in the VM baseline. Content validation checks
that the fingerprint is plausible. The player should feel rewarded for competent
debugging rather than guessing.

### 7.7 Progression

Progression unlocks:
- broader sudo access
- new servers
- more dangerous responsibilities
- better internal docs
- helper scripts and diagnostics

This is institutional progression, not character stats.

### 7.8 Mentor Thread

Marcus is the primary mentor character. His dialogue runs across the full game
as a `series_id: marcus-main` thread. Each dialogue file that belongs to an
ongoing character relationship declares `series_id` and `series_position`.

The dialogue system tracks series state so Marcus remembers what happened in
earlier quests and can reference it in later ones. This is the primary vehicle
for institutional memory and character continuity.

### 7.9 Tone and Humor

The tone is dry, realistic, and slightly dysfunctional. Examples:
- contradictory runbooks
- tickets that misidentify the problem
- passive-aggressive internal notes
- perfect urgency attached to trivial formatting requests

Humor must support immersion, not break it.

---

## 8. COMMAND AND ACCESS MODEL

Access is controlled realistically through:
- user accounts and group membership
- sudoers configuration
- reachable hosts
- available packages and tooling

If a player cannot run `systemctl`, the reason is that the VM account lacks the
required privileges, not that the game disabled the verb.

---

## 9. PRESENTATION LAYER

The player's view is the workstation VM desktop, viewed fullscreen via SPICE:

```bash
scripts/start-game.sh
# → starts game server
# → virsh start sc-workstation (if not already running)
# → remote-viewer --full-screen spice://127.0.0.1:<port>
```

The player sees an XFCE desktop with Chromium pre-opened to the HUD.

### 9.1 VM Display

- **Protocol**: SPICE with QXL video driver
- **Client**: `remote-viewer` (from `virt-viewer` package) in fullscreen mode
- **Resolution**: dynamic — guest vdagent resizes to match host display
- **Cursor release**: `Ctrl+Alt`; fullscreen toggle: `F11`
- **Clipboard sharing**: via spice-vdagent in the guest

No VNC, no custom viewer widget. The host runs `remote-viewer` and the player
works inside the workstation VM.

### 9.2 HUD (Svelte Web App)

The game HUD is a Svelte single-page app served at `http://192.168.100.1:3000`:

- **TicketsPanel** — ticket list, detail view, "Mark Complete" button
- **MailPanel** — inbox, message body, reply buttons (where applicable)
- **DocsPanel** — trust-gated internal docs, rendered from content/docs/
- **SagePanel** — chat interface to SageService knowledge base
- **HeaderBar** — trust indicator (no number, behavior only), shift timer, unread badge

The HUD is a company intranet portal in look and feel — dark, monospace, minimal.

### 9.3 One-Time Setup and Uninstall

Host-side setup is unavoidable (KVM, libvirt, VM images). It must be simple.

Principles:
- one-time setup only (`tools/setup/first-run-setup.sh`)
- plain-language explanation of what will be installed
- managed resources use the `sc-` prefix (never touch other libvirt domains)
- full uninstall removes all game-owned domains, networks, storage, helper files
- normal gameplay does not require broad `sudo`

---

## 10. DATA MODEL

Authoring formats:
- JSON for quests, tickets, incidents, dialogue, documentation metadata
- Shell helper scripts where CLI integration is necessary

Top-level content domains:

| Domain | Purpose |
|--------|---------|
| `quests/` | Objective chains and validation rules |
| `tickets/` | Player-facing problem statements |
| `incidents/` | Dynamic system pressure events |
| `dialogue/` | Workplace messages, hints, follow-ups |
| `docs/` | Internal documentation metadata/content |
| `progression/` | Trust thresholds, unlocks, access tiers |
| `vm_profiles/` | Domain names, snapshots, networks, probe config |
| `helpers/` | Non-obvious guest helper naming/config data |
| `world_flags/` | Central registry of all world state flags |

Each authored scenario must declare:
- `required_vms` — all VMs the quest touches
- `baseline_snapshot` — starting snapshot for this quest
- `clue_fingerprint` — evidence declared in the VM baseline
- validation rules and branch priorities
- escalation behavior
- trust impact
- `blast_radius` — incident IDs the quest may interact with
- follow-on world effects

---

## 11. SAVE MODEL

### 11.1 Dirty State Model

The game uses a **dirty state model**. VM disk state is preserved across
sessions as-is. The game does not revert to a clean baseline on load — it
resumes from whatever state the VMs are currently in.

This is intentional. The player's history of changes is part of the game. A
machine they fixed stays fixed. A machine they damaged stays damaged until they
repair it or request reimage.

Two persistence layers:

**Game State Layer** — saved as JSON:
- Trust score and history
- Unlocked access, sudo scopes, docs, tools
- Active/completed quest and ticket state
- World flags (current values and change history)
- Incident scheduler state
- In-world clock and shift counter

**VM State Layer** — saved as libvirt snapshot references:
- Per-VM reference to current snapshot tier or live disk
- Per-VM managed recovery checkpoint list
- Reimage history per VM

### 11.2 Shift Checkpoints

At the start of each in-game shift:
1. Game state JSON is saved
2. A named snapshot is created per active VM: `checkpoint.shift-{N}`
3. The checkpoint reference is recorded in the save file
4. Shift checkpoints beyond the retention limit (default: 5) are pruned

Shift checkpoint rollback is an explicit player action ("start this shift
over") with a confirmation prompt. It does not undo trust changes or dialogue
already delivered.

### 11.3 Load-Time Reconciliation

On load, the observation service validates current VM state against saved world
flags. Minor drift is handled silently. Major drift — missing snapshots,
unbootable VMs — triggers the recovery flow.

If a referenced snapshot is missing:
- If `baseline.recovery` exists, offer resume from recovery
- If `baseline.recovery` is also gone, the VM is treated as unrecoverable

### 11.4 Recovery / Reimage Flow

When a VM is unrecoverable, the player can report it for reimage through an
in-world mechanic:

1. Player submits a reimage request (ticket to management)
2. In-world delay is imposed (one in-game shift)
3. Machine is restored from `baseline.recovery` or `baseline.clean`
4. Trust penalty is applied based on severity
5. In-progress quests on that VM are reset
6. Evidence from before the reimage is gone — acknowledged in-world

This is the designed escape valve. It has visible consequences but allows
forward progress.

### 11.5 Host Storage Management

qcow2 images with many snapshots can balloon. The game enforces:
- Maximum of 5 shift checkpoints per VM (configurable in vm_profile)
- Authored baseline and recovery snapshots are never pruned by the game
- `resource_budget` in vm_profile declares expected disk footprint

### 11.6 Developer Reset

Not available in the shipped game. CLI only:

```bash
bash tools/vm/snapshot-all.sh --revert-to baseline.clean
```

Completely resets all VMs to authored baseline. Used during content authoring
and automated test runs.

---

## 12. MODULE BREAKDOWN

### Server (`server/src/`)

| Module | Responsibility |
|--------|----------------|
| `index.js` | Express + WebSocket entry point; service wiring; static file serving |
| `ContentLoader` | Loads all content/ JSON at startup; never writes |
| `QuestEngine` | Quest state machine (pending → active → resolved) |
| `TicketService` | Ticket state, mark-complete handler, branch resolution |
| `ValidationEngine` | SSH into VMs, evaluates all rule types against real state |
| `VMManager` | virsh start/stop/snapshot/getIP wrappers |
| `TrustSystem` | Score tracking, unlock evaluation, revocation |
| `ProgressionSystem` | Unlocked docs, VMs, access strings |
| `EmailService` | Inbox, follow-up emails, reply options, WebSocket push |
| `SageService` | Rule-based dialogue / knowledge base |
| `ShiftTimer` | Shift clock, broadcasts shift:tick via WebSocket |
| `IncidentScheduler` | Pressure tick loop, incident injection |
| `ShiftReviewService` | End-of-shift performance review email generation |
| `CertificationService` | Awards internal certs after quest chain completion |
| `SaveState` | Read/write `~/.local/share/sysadmin-chronicles/save.json` |
| `lib/ssh.js` | Promisified SSH command execution (node-ssh) |
| `lib/virsh.js` | virsh command wrappers |
| `lib/eventBus.js` | Internal Node.js EventEmitter for service coordination |

### Frontend (`frontend/src/`)

| Component | Responsibility |
|-----------|----------------|
| `App.svelte` | Root component; WebSocket connection; panel routing |
| `TicketsPanel` | Ticket list, detail, mark-complete flow |
| `MailPanel` | Inbox, message body, reply buttons |
| `DocsPanel` | Trust-gated doc list and content viewer |
| `SagePanel` | Chat interface, follow-up prompts |
| `VmsPanel` | Live VM status indicators |
| `HeaderBar` | Trust display, shift timer, mail unread count |
| `lib/api.js` | Fetch wrapper for all REST API calls |

---

## 13. SECURITY AND SAFETY

Requirements:
- Scope libvirt resources to dedicated game domains/networks/storage pools
- Never operate on arbitrary host VMs by default
- Use explicit naming/prefixing for all game-managed resources (`sc-` prefix)
- Separate quest-mode constrained networks from broader sandbox networks
- Prefer least-privilege host integration
- Provide a dry-run and diagnostic mode for development scripts

The game manages only the resources it created or was explicitly pointed at
during setup.

---

## 14. TECHNOLOGY DECISIONS

| Technology | Role | Reason |
|-----------|------|--------|
| Node.js / Express | Game server | Async I/O, native SSH/virsh via child_process, easy JSON |
| Svelte / Vite | Web HUD | Lightweight, no virtual DOM overhead, fast build |
| WebSocket (`ws`) | Real-time push | Trust changes, mail, incidents without polling |
| QEMU/KVM | Virtualization backend | Real Linux environments |
| libvirt / virsh | VM lifecycle control | Standard Linux automation surface |
| SPICE + QXL | Workstation display | Dynamic resolution, clipboard sharing, fullscreen |
| `remote-viewer` | Host-side SPICE client | Ships with virt-viewer; fullscreen with F11 |
| JSON | Content authoring | Data-driven, easy to diff, unchanged from prior design |
| node-ssh | SSH execution in validation | Clean Promise API; BatchMode, key-based auth |

Not in scope: v86, WebAssembly, browser-only runtime, service-worker networking.

---

## 15. DEVELOPMENT PRIORITIES

1. Native architecture consistency
2. VM control integration
3. Observation and validation
4. Core gameplay loop
5. Pressure, trust, and dynamic event systems
6. Presentation polish

If a design choice improves presentation but weakens VM realism or maintainable
automation, reject it.