Files
44r0n7 0265afa054 chore: bootstrap lean sysadmin-chronicles repo
Import the runnable game code, content, docs, scripts, and repo guidance while leaving local agent state, dependency installs, build output, and backup copies out of the published tree.
2026-05-02 11:49:07 -04:00

12 KiB

SYSADMIN CHRONICLES — SAVE SYSTEM DESIGN

Version 1.3 | Status: Active development

Changelog: v1.3 — Defined persists: false flag semantics (shift boundary reset). Added world flag persistence rules section.

This document covers the save model, VM persistence policy, dirty state handling, recovery flows, and the design decisions behind them.


THE CORE TENSION

The game wants real VMs. Real VMs have real state. That state changes as the player works. The question is: what do we save, when, and what happens when things go wrong?

Two broad approaches exist:

Approach A — Replay Model Save authored flags and game state only. On load, restore a baseline snapshot and replay authored events to reconstruct the world. Simple, cheap, predictable.

Approach B — Dirty State Model Preserve actual VM disk state as-is. Save references to the current snapshot or live qcow2 state. On load, the VM resumes exactly where it was.

This game uses Approach B, with structured recovery fallbacks. Here is why, and what that means in practice.


WHY DIRTY STATE

The replay model breaks the design contract. If the player spent forty minutes debugging a broken service, leaving behind log entries, partial edits, and useful breadcrumbs, restoring a clean baseline erases all of that. The world forgets. That is not how real systems work.

The dirty state model means:

  • The player's workstation remembers what they did
  • Target VMs remember fixes applied and mistakes made
  • Evidence persists — good and bad
  • A machine the player damaged stays damaged until they fix it or request reimage
  • A machine they set up correctly stays correct

Operational note:

  • The workstation should be treated as a curated terminal-first appliance image whose shell history, local config, and jump-box state persist like any other VM state
  • Desktop-like company tools live in the game state layer, not inside a VM browser session
  • Rebuilding the workstation runtime on every reset would create slow, noisy, and inconsistent recovery behavior

This is more expensive. It is also the point of the game.


WHAT GETS SAVED

Game State Layer

Saved as structured JSON. Cheap, fast, always consistent.

  • Player trust score and history
  • Unlocked VMs, sudo scopes, internal docs, tools
  • Active and completed ticket/quest state
  • World flags (current values and change history)
  • Incident scheduler state (active incidents, escalation timers)
  • Per-quest authored consequence records
  • Shift timestamp and in-world clock

VM State Layer

Saved as libvirt snapshot references or qcow2 state references. Expensive but necessary.

  • Per-VM: reference to current named snapshot or live disk state
  • Per-VM: list of managed recovery checkpoints
  • Per-VM: reimage eligibility and reimage history
  • Per-VM: last-known observation data (advisory, not authoritative)

The game does not store VM disk images in the save file. It stores references to named snapshots managed by libvirt. The actual disk data lives where libvirt puts it.


WORLD FLAG PERSISTENCE RULES

Every world flag in world_flags/world_flags.json declares a persists field. This controls how the flag behaves across shift boundaries and game loads.

persists: true

The flag is written to the save file and survives indefinitely. It is cleared only when a quest or incident explicitly sets it to false, or when the VM is reimaged. Most flags are persistent — they represent stable facts about the world (nginx is configured correctly, logrotate is healthy, etc.).

persists: false

The flag is reset at the start of each new shift, regardless of its current value. It is NOT reset on game load within the same shift.

Non-persistent flags represent transient pressure states that should not carry forward into the next working session:

  • hermes_disk_healthy — disk state that may change overnight without the player's intervention
  • web_disk_pressure_active — active disk pressure event currently escalating

On shift boundary: all persists: false flags are cleared before the new shift's checkpoint is taken. Their cleared state is what gets saved.

On game load mid-shift: persists: false flags are loaded from the save file as-is. They are not reset on load, only on shift boundary.

Implementation note for SaveSystem: When writing the shift checkpoint, iterate all world flags and zero out any with persists: false before serializing. Do not zero them in the live WorldFlagRegistry until the checkpoint write is complete, to avoid mid-write state corruption.


SNAPSHOT STRATEGY FOR SAVE/LOAD

Named Snapshot Tiers

Each VM maintains three tiers of snapshots:

baseline.clean          — Authored starting state for a fresh quest arc
baseline.recovery       — Fallback if live state is unrecoverable
checkpoint.shift-{N}    — Auto-saved at start of each in-game shift
live                    — Current working state (no snapshot, just disk)

On save: the game records which snapshot tier is current per VM and any divergence from it (live state is implicitly the disk, not a snapshot).

On load: the game checks that referenced snapshots still exist and are consistent with the saved game state flags. If they are, it resumes from live disk state and continues normally.

What "Resume" Means

The game does not revert to a snapshot on load. It resumes from whatever state the VMs are currently in. The save file describes what the game thinks the world looks like. On load, the observation service validates current VM state against saved world flags and reconciles any drift.

Minor drift (service restarted, log rotated by the OS) is handled silently. Major drift (a VM that should be running is gone, a snapshot reference is missing) triggers the recovery flow.


DIRTY STATE RISKS AND MITIGATIONS

Risk 1: Snapshot Reference Goes Stale

A named snapshot the game references is deleted or corrupted outside the game.

Mitigation: On load, the save system checks all referenced snapshots exist before resuming. If a checkpoint snapshot is missing but baseline.clean exists, offer to resume from baseline with authored-flag reconstruction where possible. If baseline.clean is also gone, the VM is treated as unrecoverable and the reimage flow is offered.

Risk 2: Live Disk State is Unbootable

The player damaged the VM beyond booting — corrupted bootloader, deleted critical system files, broke networking in a way that prevents observation.

Mitigation: The game detects unbootable VMs through libvirt domain state and failed SSH probes. The player is notified in-world ("hermes is not responding") and the reimage flow is offered. The game does not attempt to force-boot or auto-repair.

Risk 3: Multiple VMs Diverge from Each Other

The player fixed hermes but their notes reference a service that is now configured differently. Cross-VM state is inconsistent with authored expectations.

Mitigation: World flags are the source of truth for cross-VM consequences, not raw VM state. If the flags say nginx_stable but hermes currently has nginx failed, the validation service surfaces this on next observation pass and raises an in-world event. The player is not penalized for drift that happens while they are offline — but they are informed.

Risk 4: Disk Space on Host

qcow2 images with many snapshots can balloon. Long save histories consume real host storage.

Mitigation: Managed checkpoint retention policy. The game keeps a maximum of N shift checkpoints per VM (default: 5) and prunes the oldest on new checkpoint creation. Authored baseline and recovery snapshots are never pruned by the game. A storage budget field in vm_profiles allows per-VM tuning.

Resource budget note:

  • Budget the workstation separately from server VMs
  • Even a modest workstation profile should be budgeted separately from server VMs
  • Save/recovery tooling should assume workstation snapshots are the most storage-expensive routine snapshots in the fleet
  • Earlier lab builds showed that browser-capable workstation images can exceed small cloud-image defaults quickly; the terminal-first plan avoids much of that pressure, but disk budgets still need to be explicit

THE REIMAGE FLOW

When a VM is unrecoverable, the player can report it for reimage through an in-world mechanic (ticket to management or ops channel).

Flow:

  1. Player submits a reimage request for the affected machine
  2. An in-world delay is imposed (e.g., 1 in-game shift)
  3. The machine is restored from baseline.recovery or baseline.clean
  4. Trust penalty is applied based on severity
  5. Any in-progress quests on that VM are reset to their baseline state
  6. Evidence from before the reimage is gone — acknowledged in-world as "we had to wipe the machine"

This is not a free reset. It has visible consequences. But it allows the game to continue rather than becoming permanently stuck.

The reimage flow is the designed escape valve, not a hidden automatic recovery.


SHIFT CHECKPOINTS

At the start of each in-game shift, the game:

  1. Clears all persists: false world flags
  2. Saves all game state JSON (with non-persistent flags already zeroed)
  3. Creates a named snapshot for each active VM: checkpoint.shift-{N}
  4. Records the checkpoint reference in the save file
  5. Prunes shift checkpoints beyond the retention limit

This gives the player a rollback option at shift granularity if they want to undo a disastrous session, at the cost of losing that shift's work entirely.

Shift checkpoint rollback is an explicit player action, not automatic. It is presented as "start this shift over" and requires confirmation. It does not undo trust changes or world flag consequences that were sent to other characters (e.g., dialogue already delivered, tickets already closed).


DEVELOPER RESET

For authoring and testing, a separate CLI tool exists outside the game:

bash tools/vm/snapshot-all.sh --revert-to baseline.clean

This is not accessible in the shipped game. It completely resets all VMs to their authored baseline. Used during content authoring and automated test runs.


SAVE FILE STRUCTURE (DRAFT SCHEMA)

{
  "save_version": 1,
  "player": {
    "trust": 14,
    "trust_history": [],
    "unlocks": ["sudo:systemctl", "vm:build_machine"],
    "current_shift": 7
  },
  "world": {
    "flags": {
      "player_ssh_configured": true,
      "nginx_stable": true,
      "hermes_logrotate_healthy": false,
      "hermes_log_pressure_pending": true,
      "hermes_disk_healthy": false
    },
    "flag_history": [],
    "_note": "persists:false flags are zeroed at shift boundary before this snapshot is written. They survive game load within the same shift."
  },
  "quests": {
    "completed": ["Q001", "Q002"],
    "failed": [],
    "active": ["Q003"],
    "branch_outcomes": {
      "Q002": "config-fixed-enabled"
    }
  },
  "tickets": {
    "active": ["T003"],
    "closed": ["T001", "T002"]
  },
  "incidents": {
    "active": [
      {
        "id": "I001",
        "started_at_shift": 6,
        "escalation_step_reached": 1
      }
    ],
    "resolved": []
  },
  "vms": {
    "workstation": {
      "current_snapshot_tier": "live",
      "last_checkpoint": "checkpoint.shift-6",
      "recovery_snapshot": "baseline.recovery",
      "reimage_count": 0,
      "last_observation": {}
    },
    "web_server": {
      "current_snapshot_tier": "live",
      "last_checkpoint": "checkpoint.shift-6",
      "recovery_snapshot": "baseline.recovery",
      "reimage_count": 0,
      "last_observation": {}
    }
  }
}

DESIGN PRINCIPLES SUMMARY

  • The dirty state is the game. Preserving it is the point.
  • Snapshots are structured fallbacks, not the primary save mechanism.
  • The game never silently reverts VM state without player awareness.
  • Recovery from failure is in-world and has consequences.
  • The host disk cost is real and must be managed with a retention policy.
  • Developers get clean-reset tooling outside the shipped game.
  • persists: false flags reset at shift boundary, not on load.