sysadmin-chronicles/docs/SAVE_SYSTEM.md

# SYSADMIN CHRONICLES — SAVE SYSTEM DESIGN
> Version 1.3 | Status: Active development
>
> Changelog:
>   v1.3 — Defined `persists: false` flag semantics (shift boundary reset).
>           Added world flag persistence rules section.
>
> This document covers the save model, VM persistence policy, dirty state
> handling, recovery flows, and the design decisions behind them.

---

## THE CORE TENSION

The game wants real VMs. Real VMs have real state. That state changes as the
player works. The question is: what do we save, when, and what happens when
things go wrong?

Two broad approaches exist:

**Approach A — Replay Model**
Save authored flags and game state only. On load, restore a baseline snapshot
and replay authored events to reconstruct the world. Simple, cheap, predictable.

**Approach B — Dirty State Model**
Preserve actual VM disk state as-is. Save references to the current snapshot or
live qcow2 state. On load, the VM resumes exactly where it was.

This game uses **Approach B**, with structured recovery fallbacks. Here is why,
and what that means in practice.

---

## WHY DIRTY STATE

The replay model breaks the design contract. If the player spent forty minutes
debugging a broken service, leaving behind log entries, partial edits, and
useful breadcrumbs, restoring a clean baseline erases all of that. The world
forgets. That is not how real systems work.

The dirty state model means:
- The player's workstation remembers what they did
- Target VMs remember fixes applied and mistakes made
- Evidence persists — good and bad
- A machine the player damaged stays damaged until they fix it or request reimage
- A machine they set up correctly stays correct

Operational note:
- The workstation should be treated as a curated terminal-first appliance image
  whose shell history, local config, and jump-box state persist like any other VM state
- Desktop-like company tools live in the game state layer, not inside a VM browser session
- Rebuilding the workstation runtime on every reset would create slow, noisy,
  and inconsistent recovery behavior

This is more expensive. It is also the point of the game.

---

## WHAT GETS SAVED

### Game State Layer
Saved as structured JSON. Cheap, fast, always consistent.

- Player trust score and history
- Unlocked VMs, sudo scopes, internal docs, tools
- Active and completed ticket/quest state
- World flags (current values and change history)
- Incident scheduler state (active incidents, escalation timers)
- Per-quest authored consequence records
- Shift timestamp and in-world clock

### VM State Layer
Saved as libvirt snapshot references or qcow2 state references. Expensive but
necessary.

- Per-VM: reference to current named snapshot or live disk state
- Per-VM: list of managed recovery checkpoints
- Per-VM: reimage eligibility and reimage history
- Per-VM: last-known observation data (advisory, not authoritative)

The game does not store VM disk images in the save file. It stores references to
named snapshots managed by libvirt. The actual disk data lives where libvirt
puts it.

---

## WORLD FLAG PERSISTENCE RULES

Every world flag in `world_flags/world_flags.json` declares a `persists` field.
This controls how the flag behaves across shift boundaries and game loads.

### `persists: true`
The flag is written to the save file and survives indefinitely. It is cleared
only when a quest or incident explicitly sets it to false, or when the VM is
reimaged. Most flags are persistent — they represent stable facts about the
world (nginx is configured correctly, logrotate is healthy, etc.).

### `persists: false`
The flag is **reset at the start of each new shift**, regardless of its current
value. It is NOT reset on game load within the same shift.

Non-persistent flags represent transient pressure states that should not carry
forward into the next working session:
- `hermes_disk_healthy` — disk state that may change overnight without the player's intervention
- `web_disk_pressure_active` — active disk pressure event currently escalating

**On shift boundary**: all `persists: false` flags are cleared before the new
shift's checkpoint is taken. Their cleared state is what gets saved.

**On game load mid-shift**: `persists: false` flags are loaded from the save
file as-is. They are not reset on load, only on shift boundary.

**Implementation note for `SaveSystem`**: When writing the shift checkpoint,
iterate all world flags and zero out any with `persists: false` before
serializing. Do not zero them in the live `WorldFlagRegistry` until the
checkpoint write is complete, to avoid mid-write state corruption.

---

## SNAPSHOT STRATEGY FOR SAVE/LOAD

### Named Snapshot Tiers

Each VM maintains three tiers of snapshots:

```
baseline.clean          — Authored starting state for a fresh quest arc
baseline.recovery       — Fallback if live state is unrecoverable
checkpoint.shift-{N}    — Auto-saved at start of each in-game shift
live                    — Current working state (no snapshot, just disk)
```

On save: the game records which snapshot tier is current per VM and any
divergence from it (live state is implicitly the disk, not a snapshot).

On load: the game checks that referenced snapshots still exist and are
consistent with the saved game state flags. If they are, it resumes from live
disk state and continues normally.

### What "Resume" Means

The game does not revert to a snapshot on load. It resumes from whatever state
the VMs are currently in. The save file describes what the game *thinks* the
world looks like. On load, the observation service validates current VM state
against saved world flags and reconciles any drift.

Minor drift (service restarted, log rotated by the OS) is handled silently.
Major drift (a VM that should be running is gone, a snapshot reference is
missing) triggers the recovery flow.

---

## DIRTY STATE RISKS AND MITIGATIONS

### Risk 1: Snapshot Reference Goes Stale
A named snapshot the game references is deleted or corrupted outside the game.

Mitigation: On load, the save system checks all referenced snapshots exist
before resuming. If a checkpoint snapshot is missing but baseline.clean exists,
offer to resume from baseline with authored-flag reconstruction where possible.
If baseline.clean is also gone, the VM is treated as unrecoverable and the
reimage flow is offered.

### Risk 2: Live Disk State is Unbootable
The player damaged the VM beyond booting — corrupted bootloader, deleted
critical system files, broke networking in a way that prevents observation.

Mitigation: The game detects unbootable VMs through libvirt domain state and
failed SSH probes. The player is notified in-world ("hermes is not responding")
and the reimage flow is offered. The game does not attempt to force-boot or
auto-repair.

### Risk 3: Multiple VMs Diverge from Each Other
The player fixed hermes but their notes reference a service that is now
configured differently. Cross-VM state is inconsistent with authored
expectations.

Mitigation: World flags are the source of truth for cross-VM consequences, not
raw VM state. If the flags say nginx_stable but hermes currently has nginx
failed, the validation service surfaces this on next observation pass and raises
an in-world event. The player is not penalized for drift that happens while they
are offline — but they are informed.

### Risk 4: Disk Space on Host
qcow2 images with many snapshots can balloon. Long save histories consume real
host storage.

Mitigation: Managed checkpoint retention policy. The game keeps a maximum of N
shift checkpoints per VM (default: 5) and prunes the oldest on new checkpoint
creation. Authored baseline and recovery snapshots are never pruned by the game.
A storage budget field in vm_profiles allows per-VM tuning.

Resource budget note:
- Budget the workstation separately from server VMs
- Even a modest workstation profile should be budgeted separately from server VMs
- Save/recovery tooling should assume workstation snapshots are the most
  storage-expensive routine snapshots in the fleet
- Earlier lab builds showed that browser-capable workstation images can exceed
  small cloud-image defaults quickly; the terminal-first plan avoids much of
  that pressure, but disk budgets still need to be explicit

---

## THE REIMAGE FLOW

When a VM is unrecoverable, the player can report it for reimage through an
in-world mechanic (ticket to management or ops channel).

Flow:
1. Player submits a reimage request for the affected machine
2. An in-world delay is imposed (e.g., 1 in-game shift)
3. The machine is restored from baseline.recovery or baseline.clean
4. Trust penalty is applied based on severity
5. Any in-progress quests on that VM are reset to their baseline state
6. Evidence from before the reimage is gone — acknowledged in-world as "we
   had to wipe the machine"

This is not a free reset. It has visible consequences. But it allows the game
to continue rather than becoming permanently stuck.

The reimage flow is the designed escape valve, not a hidden automatic recovery.

---

## SHIFT CHECKPOINTS

At the start of each in-game shift, the game:
1. Clears all `persists: false` world flags
2. Saves all game state JSON (with non-persistent flags already zeroed)
3. Creates a named snapshot for each active VM: `checkpoint.shift-{N}`
4. Records the checkpoint reference in the save file
5. Prunes shift checkpoints beyond the retention limit

This gives the player a rollback option at shift granularity if they want to
undo a disastrous session, at the cost of losing that shift's work entirely.

Shift checkpoint rollback is an explicit player action, not automatic. It is
presented as "start this shift over" and requires confirmation. It does not
undo trust changes or world flag consequences that were sent to other characters
(e.g., dialogue already delivered, tickets already closed).

---

## DEVELOPER RESET

For authoring and testing, a separate CLI tool exists outside the game:

```bash
bash tools/vm/snapshot-all.sh --revert-to baseline.clean
```

This is not accessible in the shipped game. It completely resets all VMs to
their authored baseline. Used during content authoring and automated test runs.

---

## SAVE FILE STRUCTURE (DRAFT SCHEMA)

```json
{
  "save_version": 1,
  "player": {
    "trust": 14,
    "trust_history": [],
    "unlocks": ["sudo:systemctl", "vm:build_machine"],
    "current_shift": 7
  },
  "world": {
    "flags": {
      "player_ssh_configured": true,
      "nginx_stable": true,
      "hermes_logrotate_healthy": false,
      "hermes_log_pressure_pending": true,
      "hermes_disk_healthy": false
    },
    "flag_history": [],
    "_note": "persists:false flags are zeroed at shift boundary before this snapshot is written. They survive game load within the same shift."
  },
  "quests": {
    "completed": ["Q001", "Q002"],
    "failed": [],
    "active": ["Q003"],
    "branch_outcomes": {
      "Q002": "config-fixed-enabled"
    }
  },
  "tickets": {
    "active": ["T003"],
    "closed": ["T001", "T002"]
  },
  "incidents": {
    "active": [
      {
        "id": "I001",
        "started_at_shift": 6,
        "escalation_step_reached": 1
      }
    ],
    "resolved": []
  },
  "vms": {
    "workstation": {
      "current_snapshot_tier": "live",
      "last_checkpoint": "checkpoint.shift-6",
      "recovery_snapshot": "baseline.recovery",
      "reimage_count": 0,
      "last_observation": {}
    },
    "web_server": {
      "current_snapshot_tier": "live",
      "last_checkpoint": "checkpoint.shift-6",
      "recovery_snapshot": "baseline.recovery",
      "reimage_count": 0,
      "last_observation": {}
    }
  }
}
```

---

## DESIGN PRINCIPLES SUMMARY

- The dirty state is the game. Preserving it is the point.
- Snapshots are structured fallbacks, not the primary save mechanism.
- The game never silently reverts VM state without player awareness.
- Recovery from failure is in-world and has consequences.
- The host disk cost is real and must be managed with a retention policy.
- Developers get clean-reset tooling outside the shipped game.
- `persists: false` flags reset at shift boundary, not on load.