0265afa054
Import the runnable game code, content, docs, scripts, and repo guidance while leaving local agent state, dependency installs, build output, and backup copies out of the published tree.
331 lines
12 KiB
Markdown
331 lines
12 KiB
Markdown
# SYSADMIN CHRONICLES — SAVE SYSTEM DESIGN
|
|
> Version 1.3 | Status: Active development
|
|
>
|
|
> Changelog:
|
|
> v1.3 — Defined `persists: false` flag semantics (shift boundary reset).
|
|
> Added world flag persistence rules section.
|
|
>
|
|
> This document covers the save model, VM persistence policy, dirty state
|
|
> handling, recovery flows, and the design decisions behind them.
|
|
|
|
---
|
|
|
|
## THE CORE TENSION
|
|
|
|
The game wants real VMs. Real VMs have real state. That state changes as the
|
|
player works. The question is: what do we save, when, and what happens when
|
|
things go wrong?
|
|
|
|
Two broad approaches exist:
|
|
|
|
**Approach A — Replay Model**
|
|
Save authored flags and game state only. On load, restore a baseline snapshot
|
|
and replay authored events to reconstruct the world. Simple, cheap, predictable.
|
|
|
|
**Approach B — Dirty State Model**
|
|
Preserve actual VM disk state as-is. Save references to the current snapshot or
|
|
live qcow2 state. On load, the VM resumes exactly where it was.
|
|
|
|
This game uses **Approach B**, with structured recovery fallbacks. Here is why,
|
|
and what that means in practice.
|
|
|
|
---
|
|
|
|
## WHY DIRTY STATE
|
|
|
|
The replay model breaks the design contract. If the player spent forty minutes
|
|
debugging a broken service, leaving behind log entries, partial edits, and
|
|
useful breadcrumbs, restoring a clean baseline erases all of that. The world
|
|
forgets. That is not how real systems work.
|
|
|
|
The dirty state model means:
|
|
- The player's workstation remembers what they did
|
|
- Target VMs remember fixes applied and mistakes made
|
|
- Evidence persists — good and bad
|
|
- A machine the player damaged stays damaged until they fix it or request reimage
|
|
- A machine they set up correctly stays correct
|
|
|
|
Operational note:
|
|
- The workstation should be treated as a curated terminal-first appliance image
|
|
whose shell history, local config, and jump-box state persist like any other VM state
|
|
- Desktop-like company tools live in the game state layer, not inside a VM browser session
|
|
- Rebuilding the workstation runtime on every reset would create slow, noisy,
|
|
and inconsistent recovery behavior
|
|
|
|
This is more expensive. It is also the point of the game.
|
|
|
|
---
|
|
|
|
## WHAT GETS SAVED
|
|
|
|
### Game State Layer
|
|
Saved as structured JSON. Cheap, fast, always consistent.
|
|
|
|
- Player trust score and history
|
|
- Unlocked VMs, sudo scopes, internal docs, tools
|
|
- Active and completed ticket/quest state
|
|
- World flags (current values and change history)
|
|
- Incident scheduler state (active incidents, escalation timers)
|
|
- Per-quest authored consequence records
|
|
- Shift timestamp and in-world clock
|
|
|
|
### VM State Layer
|
|
Saved as libvirt snapshot references or qcow2 state references. Expensive but
|
|
necessary.
|
|
|
|
- Per-VM: reference to current named snapshot or live disk state
|
|
- Per-VM: list of managed recovery checkpoints
|
|
- Per-VM: reimage eligibility and reimage history
|
|
- Per-VM: last-known observation data (advisory, not authoritative)
|
|
|
|
The game does not store VM disk images in the save file. It stores references to
|
|
named snapshots managed by libvirt. The actual disk data lives where libvirt
|
|
puts it.
|
|
|
|
---
|
|
|
|
## WORLD FLAG PERSISTENCE RULES
|
|
|
|
Every world flag in `world_flags/world_flags.json` declares a `persists` field.
|
|
This controls how the flag behaves across shift boundaries and game loads.
|
|
|
|
### `persists: true`
|
|
The flag is written to the save file and survives indefinitely. It is cleared
|
|
only when a quest or incident explicitly sets it to false, or when the VM is
|
|
reimaged. Most flags are persistent — they represent stable facts about the
|
|
world (nginx is configured correctly, logrotate is healthy, etc.).
|
|
|
|
### `persists: false`
|
|
The flag is **reset at the start of each new shift**, regardless of its current
|
|
value. It is NOT reset on game load within the same shift.
|
|
|
|
Non-persistent flags represent transient pressure states that should not carry
|
|
forward into the next working session:
|
|
- `hermes_disk_healthy` — disk state that may change overnight without the player's intervention
|
|
- `web_disk_pressure_active` — active disk pressure event currently escalating
|
|
|
|
**On shift boundary**: all `persists: false` flags are cleared before the new
|
|
shift's checkpoint is taken. Their cleared state is what gets saved.
|
|
|
|
**On game load mid-shift**: `persists: false` flags are loaded from the save
|
|
file as-is. They are not reset on load, only on shift boundary.
|
|
|
|
**Implementation note for `SaveSystem`**: When writing the shift checkpoint,
|
|
iterate all world flags and zero out any with `persists: false` before
|
|
serializing. Do not zero them in the live `WorldFlagRegistry` until the
|
|
checkpoint write is complete, to avoid mid-write state corruption.
|
|
|
|
---
|
|
|
|
## SNAPSHOT STRATEGY FOR SAVE/LOAD
|
|
|
|
### Named Snapshot Tiers
|
|
|
|
Each VM maintains three tiers of snapshots:
|
|
|
|
```
|
|
baseline.clean — Authored starting state for a fresh quest arc
|
|
baseline.recovery — Fallback if live state is unrecoverable
|
|
checkpoint.shift-{N} — Auto-saved at start of each in-game shift
|
|
live — Current working state (no snapshot, just disk)
|
|
```
|
|
|
|
On save: the game records which snapshot tier is current per VM and any
|
|
divergence from it (live state is implicitly the disk, not a snapshot).
|
|
|
|
On load: the game checks that referenced snapshots still exist and are
|
|
consistent with the saved game state flags. If they are, it resumes from live
|
|
disk state and continues normally.
|
|
|
|
### What "Resume" Means
|
|
|
|
The game does not revert to a snapshot on load. It resumes from whatever state
|
|
the VMs are currently in. The save file describes what the game *thinks* the
|
|
world looks like. On load, the observation service validates current VM state
|
|
against saved world flags and reconciles any drift.
|
|
|
|
Minor drift (service restarted, log rotated by the OS) is handled silently.
|
|
Major drift (a VM that should be running is gone, a snapshot reference is
|
|
missing) triggers the recovery flow.
|
|
|
|
---
|
|
|
|
## DIRTY STATE RISKS AND MITIGATIONS
|
|
|
|
### Risk 1: Snapshot Reference Goes Stale
|
|
A named snapshot the game references is deleted or corrupted outside the game.
|
|
|
|
Mitigation: On load, the save system checks all referenced snapshots exist
|
|
before resuming. If a checkpoint snapshot is missing but baseline.clean exists,
|
|
offer to resume from baseline with authored-flag reconstruction where possible.
|
|
If baseline.clean is also gone, the VM is treated as unrecoverable and the
|
|
reimage flow is offered.
|
|
|
|
### Risk 2: Live Disk State is Unbootable
|
|
The player damaged the VM beyond booting — corrupted bootloader, deleted
|
|
critical system files, broke networking in a way that prevents observation.
|
|
|
|
Mitigation: The game detects unbootable VMs through libvirt domain state and
|
|
failed SSH probes. The player is notified in-world ("hermes is not responding")
|
|
and the reimage flow is offered. The game does not attempt to force-boot or
|
|
auto-repair.
|
|
|
|
### Risk 3: Multiple VMs Diverge from Each Other
|
|
The player fixed hermes but their notes reference a service that is now
|
|
configured differently. Cross-VM state is inconsistent with authored
|
|
expectations.
|
|
|
|
Mitigation: World flags are the source of truth for cross-VM consequences, not
|
|
raw VM state. If the flags say nginx_stable but hermes currently has nginx
|
|
failed, the validation service surfaces this on next observation pass and raises
|
|
an in-world event. The player is not penalized for drift that happens while they
|
|
are offline — but they are informed.
|
|
|
|
### Risk 4: Disk Space on Host
|
|
qcow2 images with many snapshots can balloon. Long save histories consume real
|
|
host storage.
|
|
|
|
Mitigation: Managed checkpoint retention policy. The game keeps a maximum of N
|
|
shift checkpoints per VM (default: 5) and prunes the oldest on new checkpoint
|
|
creation. Authored baseline and recovery snapshots are never pruned by the game.
|
|
A storage budget field in vm_profiles allows per-VM tuning.
|
|
|
|
Resource budget note:
|
|
- Budget the workstation separately from server VMs
|
|
- Even a modest workstation profile should be budgeted separately from server VMs
|
|
- Save/recovery tooling should assume workstation snapshots are the most
|
|
storage-expensive routine snapshots in the fleet
|
|
- Earlier lab builds showed that browser-capable workstation images can exceed
|
|
small cloud-image defaults quickly; the terminal-first plan avoids much of
|
|
that pressure, but disk budgets still need to be explicit
|
|
|
|
---
|
|
|
|
## THE REIMAGE FLOW
|
|
|
|
When a VM is unrecoverable, the player can report it for reimage through an
|
|
in-world mechanic (ticket to management or ops channel).
|
|
|
|
Flow:
|
|
1. Player submits a reimage request for the affected machine
|
|
2. An in-world delay is imposed (e.g., 1 in-game shift)
|
|
3. The machine is restored from baseline.recovery or baseline.clean
|
|
4. Trust penalty is applied based on severity
|
|
5. Any in-progress quests on that VM are reset to their baseline state
|
|
6. Evidence from before the reimage is gone — acknowledged in-world as "we
|
|
had to wipe the machine"
|
|
|
|
This is not a free reset. It has visible consequences. But it allows the game
|
|
to continue rather than becoming permanently stuck.
|
|
|
|
The reimage flow is the designed escape valve, not a hidden automatic recovery.
|
|
|
|
---
|
|
|
|
## SHIFT CHECKPOINTS
|
|
|
|
At the start of each in-game shift, the game:
|
|
1. Clears all `persists: false` world flags
|
|
2. Saves all game state JSON (with non-persistent flags already zeroed)
|
|
3. Creates a named snapshot for each active VM: `checkpoint.shift-{N}`
|
|
4. Records the checkpoint reference in the save file
|
|
5. Prunes shift checkpoints beyond the retention limit
|
|
|
|
This gives the player a rollback option at shift granularity if they want to
|
|
undo a disastrous session, at the cost of losing that shift's work entirely.
|
|
|
|
Shift checkpoint rollback is an explicit player action, not automatic. It is
|
|
presented as "start this shift over" and requires confirmation. It does not
|
|
undo trust changes or world flag consequences that were sent to other characters
|
|
(e.g., dialogue already delivered, tickets already closed).
|
|
|
|
---
|
|
|
|
## DEVELOPER RESET
|
|
|
|
For authoring and testing, a separate CLI tool exists outside the game:
|
|
|
|
```bash
|
|
bash tools/vm/snapshot-all.sh --revert-to baseline.clean
|
|
```
|
|
|
|
This is not accessible in the shipped game. It completely resets all VMs to
|
|
their authored baseline. Used during content authoring and automated test runs.
|
|
|
|
---
|
|
|
|
## SAVE FILE STRUCTURE (DRAFT SCHEMA)
|
|
|
|
```json
|
|
{
|
|
"save_version": 1,
|
|
"player": {
|
|
"trust": 14,
|
|
"trust_history": [],
|
|
"unlocks": ["sudo:systemctl", "vm:build_machine"],
|
|
"current_shift": 7
|
|
},
|
|
"world": {
|
|
"flags": {
|
|
"player_ssh_configured": true,
|
|
"nginx_stable": true,
|
|
"hermes_logrotate_healthy": false,
|
|
"hermes_log_pressure_pending": true,
|
|
"hermes_disk_healthy": false
|
|
},
|
|
"flag_history": [],
|
|
"_note": "persists:false flags are zeroed at shift boundary before this snapshot is written. They survive game load within the same shift."
|
|
},
|
|
"quests": {
|
|
"completed": ["Q001", "Q002"],
|
|
"failed": [],
|
|
"active": ["Q003"],
|
|
"branch_outcomes": {
|
|
"Q002": "config-fixed-enabled"
|
|
}
|
|
},
|
|
"tickets": {
|
|
"active": ["T003"],
|
|
"closed": ["T001", "T002"]
|
|
},
|
|
"incidents": {
|
|
"active": [
|
|
{
|
|
"id": "I001",
|
|
"started_at_shift": 6,
|
|
"escalation_step_reached": 1
|
|
}
|
|
],
|
|
"resolved": []
|
|
},
|
|
"vms": {
|
|
"workstation": {
|
|
"current_snapshot_tier": "live",
|
|
"last_checkpoint": "checkpoint.shift-6",
|
|
"recovery_snapshot": "baseline.recovery",
|
|
"reimage_count": 0,
|
|
"last_observation": {}
|
|
},
|
|
"web_server": {
|
|
"current_snapshot_tier": "live",
|
|
"last_checkpoint": "checkpoint.shift-6",
|
|
"recovery_snapshot": "baseline.recovery",
|
|
"reimage_count": 0,
|
|
"last_observation": {}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## DESIGN PRINCIPLES SUMMARY
|
|
|
|
- The dirty state is the game. Preserving it is the point.
|
|
- Snapshots are structured fallbacks, not the primary save mechanism.
|
|
- The game never silently reverts VM state without player awareness.
|
|
- Recovery from failure is in-world and has consequences.
|
|
- The host disk cost is real and must be managed with a retention policy.
|
|
- Developers get clean-reset tooling outside the shipped game.
|
|
- `persists: false` flags reset at shift boundary, not on load.
|