Skip to content

Backup flow

Backup Flow

The full backup flow is coordinated by BackupManager, but data ownership stays inside each module.

BackupManager does not read module storage directly. It stores backup configuration, starts runs, writes manifests, validates produced artifacts, classifies run status, and exposes run history through TCP methods. Modules register backup callbacks and are responsible for flushing their own pending state before producing a dump.

Lifecycle

  1. Modules initialize and register backup participants through the backup registry.
  2. BackupManager loads backup configuration and existing run manifests during server initialization.
  3. If stale directories exist under backups/.running, they are treated as failed runs and moved to backups/failed.
  4. A full run starts from cron or from StartFullBackup.
  5. The manager creates backups/.running/<run_id> and writes the initial manifest.json with status running.
  6. Participants are executed sequentially in deterministic name order.
  7. Each participant flushes or drains its own cache/queue state and writes one or more artifacts into the run directory.
  8. The manager validates every artifact: file exists, file is not empty, SHA-256 is calculated, and storage integrity checks pass where applicable.
  9. The manifest is updated after every participant.
  10. The manager classifies the run as completed, partial, or failed.
  11. The final run directory is published under backups/completed/<run_id> or backups/failed/<run_id>.
  12. Validated artifacts and the manifest are published to enabled remote backup endpoints.
  13. Run history is available through GetBackupRuns and GetBackupRun.

During the run, the manager also updates a runtime progress snapshot. Clients can read it with GetBackupProgress or subscribe to the backup.progress channel for stream updates.

Manifest

Every run writes manifest.json.

{
  "run_id": "20260629-121045-000001",
  "type": "full",
  "backup_format_version": 1,
  "trigger": "manual",
  "status": "completed",
  "started_at": 1782727845,
  "finished_at": 1782727852,
  "path": "backups/completed/20260629-121045-000001",
  "server_version": "2.0.0",
  "error_summary": "",
  "requested_participants": [],
  "participants": [
    {
      "name": "accounts",
      "critical": true,
      "status": "completed",
      "started_at": 1782727845,
      "finished_at": 1782727846,
      "error": "",
      "artifacts": [
        {
          "logical_name": "accounts.snapshot",
          "file": "accounts.snapshot",
          "path": "backups/completed/20260629-121045-000001/accounts.snapshot",
          "size_bytes": 327680,
          "sha256": "f2a1...",
          "integrity_ok": true,
          "integrity_error": "",
          "validation_ok": true,
          "validation_error": ""
        }
      ]
    }
  ]
}

The TCP run response uses the same data model, except artifact filename is represented through path and logical_name.

Critical Participants

Participants can be critical or optional. Current production participants are all critical:

Participant Artifact Critical
accounts accounts.snapshot Yes
symbols symbols.snapshot Yes
staff staff.snapshot Yes
trades trades.snapshot Yes

If a critical participant fails, the run status becomes failed. If only optional participants fail, the run status becomes partial.

Scoped Backups

StartFullBackup accepts an optional participants array. When omitted, the run is a full backup and executes all currently registered participants. When provided, the run is scoped and only the requested participants are executed.

Example scoped request:

{
  "command": "StartFullBackup",
  "extID": "1",
  "data": {
    "participants": ["trades"]
  }
}

A scoped run writes type: "scoped" and requested_participants in the run response and manifest. Scoped backups are intended for module-level operational workflows, such as taking a trades snapshot before import or before a module-scoped restore. They are not full disaster restore points.

Validation

For every artifact, the manager stores:

Field Description
size_bytes Artifact size in bytes
sha256 SHA-256 checksum
integrity_ok Participant-specific integrity check result
integrity_error Integrity check error text
validation_ok Overall artifact validation result
validation_error File, checksum, or validation error text

Large runtime stores are copied in bounded chunks. This avoids copying a whole multi-gigabyte artifact in one step and allows the process to release internal cache and allocator memory after the dump and validation phases.

Runtime Progress

Backup progress is designed for long-running dumps and UI progress bars.

The progress phases are:

Phase Description
idle No active run has published progress yet
backup A participant is creating artifacts
validate The manager is validating participant artifacts
finalize The local run is being classified and published
publish_remote Validated artifacts are being sent to configured backup servers
done Backup finished successfully or partially
failed Backup failed

Use:

  • GetBackupProgress for polling.
  • Subscribe with chanels: ["backup.progress"] for push updates. The current TCP API field name is chanels; channels is not accepted.

Use a separate TCP/WS session for progress while another session waits for the synchronous StartFullBackup command. The stream payload uses the same fields as GetBackupProgress, including overall_percent, participants_done, participants_total, remote_done, and remote_total.

Restore Flow

Restore is participant-based like backup. BackupManager validates the selected backup run first, then calls restore callbacks registered by each module.

Current restore flow:

  1. Client calls ValidateBackupRun to check manifest and artifacts.
  2. Client calls StartRestoreRun with run_id, optional dry_run, and optional participants.
  3. BackupManager blocks concurrent backup/restore operations.
  4. For actual restore, BackupManager requires maintenance/update mode unless force is explicitly set.
  5. The source run is validated again before restore starts. Only completed backup runs are restorable; partial, failed, running, cancelled, and timeout runs are rejected.
  6. For actual restore, the manager creates a pre-restore safety snapshot under backups/pre-restore/<restore_id>.
  7. If any selected critical participant cannot produce a valid safety snapshot, restore does not start.
  8. Registered participants are executed in deterministic name order.
  9. Each participant must restore its own data and rebuild or refresh its own runtime state.
  10. If a participant does not provide a restore callback, the restore run is marked failed.
  11. Restore progress and final status are written to backups/restores/<restore_id>/manifest.json.

BackupManager must not directly replace module storage files because modules own their writers, queues, caches, and domain invariants.

Current restore callback coverage:

Participant Artifact Restore state
staff staff.snapshot Implemented through module-controlled restore and cache refresh
symbols symbols.snapshot Implemented through module-controlled restore and cache refresh
accounts accounts.snapshot Implemented through runtime drain, module-controlled restore, and cache refresh
trades trades.snapshot Implemented through runtime drain, module-controlled restore, and cache refresh

Use participants when only part of the backup should be restored, for example ["staff"], ["symbols"], ["accounts"], or a combination of participants. Omitting participants attempts full restore for all currently registered participants.

dry_run is allowed in any server mode. Actual restore requires liveupdate_mode != 0, or an explicit force: true request for controlled maintenance operations.

Restore history is available through GetRestoreRuns and GetRestoreRun. The restore response and manifest include pre_restore_path; participant results include safety_snapshot_path and safety_artifacts.

Remote Restore Flow

Remote backup servers are used as offsite storage. Restore still runs through the local backup catalog:

  1. Call GetRemoteBackupRuns for the configured endpoint.
  2. Select a remote run_id.
  3. Call ImportRemoteBackupRun to download the manifest and artifacts into backups/completed/<run_id>.
  4. Call ValidateBackupRun.
  5. Call StartRestoreRun.

Operational Notes

  • StartFullBackup is synchronous in the current TCP handler: the response is returned after the run finishes or fails.
  • Backup progress is available during the synchronous call through GetBackupProgress and backup.progress; use a separate TCP/WS client for progress while the caller waits for StartFullBackup.
  • After a successful run, backup.progress emits the final done state and GetBackupProgress returns idle state on later polling.
  • Only one full backup run can execute at a time.
  • StartRestoreRun is also synchronous in the current TCP handler.
  • StopBackupServer is currently a compatibility method and returns HTTP 501.
  • Additional participant coverage, retention, runtime cancellation, and object-level fast backups are future architecture work.